Introduction to Text as Data

Chris Bail
Duke University

Harold Laswell

Harold Laswell

 

“We may classify references into categories,” wrote Laswell in 1938, “ according to the understanding which prevails among those who are accustomed to the symbols. References used in interviews may be quantified by counting the number of references which fall into each category during a selected period of time (or per thousand words uttered).”

Ahead of his time?

 

In 1935- and at the age of 21-Laswell was developing methods that tracked the association between word utterances and physiological reactions (e.g. pulse rate, electrical conductivity of the skin, and blood pressure)

Timeline of Quantitative Text Analysis

 

Time Activity
1934 Laswell Produces first Key-Word Count
1934 Vygotsky Produces first Quantitative Narrative Analaysis
1950 Gottschalk Uses Content Analysis to Track Freudian Themes
1950 Turin Applies AI to text
1952 Bereleson Publishes First Textbook on Content Analysis
1954 First Automatic Translation of Text (Georgetown Experiment)
1963 Msteller and Wallace analyze Federalist Papers
1965 Tomashevsky Further Formalizes Quantitative Narrative Analysis

Timeline of Quantitative Text Analysis

 

Time Activity
1966 Stone and Bales use mainframe computer to measure psychometric properties of text at RAND
1980 Decline of Chomskyean Formalism/NLP is Born
1980 Machine Learning is Applied to NLP
1981 Weintraub counts parts of speech
1985 Schrodt Introduces Auomated Event Coding
1986 Pennebaker develops LIWC
1989 Franzosi brings Quantitative Narrative Analysis to Social Science

Timeline of Quantitative Text Analysis

 

Time Activity
1998 First Topic Models Developed
1998 Mohr conducts first Quantitative Analysis of Worldviews
1999 Bearman et al. apply Network Methods to Narratives
2001 Blei et al. develop LDA
2003 MALLET created
2005 Quin et al use analyze political speeches using topic models
2010 King/Hopkins Bring Topic Models into mainstream
2010 Tools for Text Workshop at Washington

STRENGTHS OF NEW TEXT DATA

BIG

ALWAYS ON

NON-REACTIVE

CAPTURES SOCIAL RELATIONSHIPS

WEAKNESSES OF NEW TEXT DATA

INCOMPLETE

INACCESSIBLE

NON-REPRESENTATIVE

DRIFTING

 

ALGORITHIMICALLY CONFOUNDED

UNSTRUCTURED

SENSITIVE

ELITE/PUBLICATION BIAS

POSITIVITY-BIAS

Exploring Text-Based Datasets

List of Public Text Datasets

Now YOU Try It  

1) Pair up with your neighbor.
2) Introduce yourselves.
3) Pick a dataset from the list in the previous section—or another one that you are hoping to analyze after this course.
4) Identify at least three strengths and weaknesses of the dataset drawing upon this introduction, or other sources.

The Future of Digital Trace Data